-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a CI check to validate IP address present in tests, and a helper to clean them #165
base: develop
Are you sure you want to change the base?
Conversation
Just a friendly thought when I think of anonymizing IPs and parsing
Does it make sense to consider swapping the first subnet as the following:
and keep everything else the same? |
The parsing should work the same for any kind of addresses, so if changing the octet size fails, it would be great to catch the error earlier. |
Case in point: Additionally, I have had bad luck with parsing where a column would be based on 15 characters as an example and then when reduced would run into issues such as:
and creates a state that would never exist in the real world, but can only exist in the testing infra. So what win? The correct parsing strategy of 15 characters (imagine a real scenario where this is the 4th column and all columns are based on spacing) of the incorrect reality that was created by auto-anonymizing without spacial awareness? Long and short, parsing is a fickle beast and I have been personally bitten by this one several times. |
I'm not as concerned as @itdependsnetworks about specifically preserving exact character counts, but I do agree that it would be good to keep some uniqueness across different IPs rather than making all addresses identical. I'd vote for something like just changing the first octet of IPv4 addresses to |
report = "" | ||
try: | ||
for line in fileinput.input(filename, inplace=True): | ||
if any(pattern_to_skip in line for pattern_to_skip in PATTERNS_TO_SKIP): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit surprising to me that PATTERNS_TO_SKIP
is a list of exact substring matches, rather than a list of regex patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could be changed to regex if at some point is needed. For the current need, he simple substring matching was good enough
content = test_file.readlines() | ||
except UnicodeDecodeError as error: | ||
print(f"Warning: Not able to process {filename}: {error}") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.exit here? Or do we have files that we are expecting to have decoding errors in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the CSV one is not supported, for now. We could solve it at some point if needed
I will give a try to https://github.com/intentionet/netconan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
This PR addresses issue #166 by adding a helper script
anonymize-ip-addresses.py
, to be used in the CI to detect the presence of some IPv4 or IPv6 addresses, and to actually replace them by documentary IPsContent
anonymize-ip-addresses.py
to check for IPv4 and IPv6 addresses to detect them, and replace by documentation onescheck-anonymize-ips
andclean-anonymize-ips